AITopics | human action recognition

Collaborating Authors

human action recognition

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Unsupervised Learning of View-invariant Action Representations

Junnan Li, Yongkang Wong, Qi Zhao, Mohan Kankanhalli

Neural Information Processing SystemsFeb-12-2026, 13:18:16 GMT

Neural Information Processing Systems http://nips.cc/

action recognition, recognition, representation, (15 more...)

Neural Information Processing Systems

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Asia > Singapore > Central Region > Singapore (0.04)
North America > Canada > Quebec > Montreal (0.04)

Genre: Research Report (0.68)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Unsupervised or Indirectly Supervised Learning (0.85)

Add feedback

Unsupervised Learning of View-invariant Action Representations

Junnan Li, Yongkang Wong, Qi Zhao, Mohan Kankanhalli

Neural Information Processing SystemsNov-20-2025, 15:27:47 GMT

Recognizing human action in videos is a long-standing research problem in computer vision.

artificial intelligence, machine learning, representation, (17 more...)

Neural Information Processing Systems

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Asia > Singapore > Central Region > Singapore (0.04)
North America > Canada > Quebec > Montreal (0.04)

Genre: Research Report (0.68)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Unsupervised or Indirectly Supervised Learning (0.85)

Add feedback

Grounding Foundational Vision Models with 3D Human Poses for Robust Action Recognition

Babey, Nicholas, Gu, Tiffany, Li, Yiheng, Meo, Cristian, Zhu, Kevin

arXiv.org Artificial IntelligenceNov-11-2025

For embodied agents to effectively understand and interact within the world around them, they require a nuanced comprehension of human actions grounded in physical space. Current action recognition models, often relying on RGB video, learn superficial correlations between patterns and action labels, so they struggle to capture underlying physical interaction dynamics and human poses in complex scenes. We propose a model architecture that grounds action recognition in physical space by fusing two powerful, complementary representations: V-JEPA 2's contextual, predictive world dynamics and CoMotion's explicit, occlusion-tolerant human pose data. Our model is validated on both the InHARD and UCF-19-Y-OCC benchmarks for general action recognition and high-occlusion action recognition, respectively. Our model outperforms three other baselines, especially within complex, occlusive scenes. Our findings emphasize a need for action recognition to be supported by spatial understanding instead of statistical pattern recognition.

artificial intelligence, machine learning, pattern recognition, (14 more...)

arXiv.org Artificial Intelligence

2511.05622

Country: North America > United States (0.28)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Pattern Recognition (0.70)
Information Technology > Artificial Intelligence > Representation & Reasoning > Information Fusion (0.70)
Information Technology > Artificial Intelligence > Robots > Humanoid Robots (0.50)

Add feedback

Biomechanically consistent real-time action recognition for human-robot interaction

Li, Wanchen, Chalabi, Kahina, Maxime, Sabbah, Bousquet, Thomas, Passama, Robin, Ramdani, Sofiane, Cherubini, Andrea, Bonnet, Vincent

arXiv.org Artificial IntelligenceOct-22-2025

This paper presents a novel framework for real-time human action recognition in industrial contexts, using standard 2D cameras. We introduce a complete pipeline for robust and real-time estimation of human joint kinematics, input to a temporally smoothed Transformer-based network, for action recognition. We rely on a new dataset including 11 subjects performing various actions, to evaluate our approach. Unlike most of the literature that relies on joint center positions (JCP) and is offline, ours uses biomechanical prior, eg. joint angles, for fast and robust real-time recognition. Besides, joint angles make the proposed method agnostic to sensor and subject poses as well as to anthropometric differences, and ensure robustness across environments and subjects. Our proposed learning model outperforms the best baseline model, running also in real-time, along various metrics. It achieves 88% accuracy and shows great generalization ability, for subjects not facing the cameras. Finally, we demonstrate the robustness and usefulness of our technique, through an online interaction experiment, with a simulated robot controlled in real-time via the recognized actions.

artificial intelligence, machine learning, real time system, (18 more...)

arXiv.org Artificial Intelligence

2510.18373

Country: Europe > France > Occitanie (0.28)

Genre: Research Report (0.64)

Industry: Health & Medicine (0.47)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Architecture > Real Time Systems (1.00)

Add feedback

d37eb50d868361ea729bb4147eb3c1d8-AuthorFeedback.pdf

Neural Information Processing SystemsAug-16-2025, 14:44:00 GMT

We thank all the reviewers for their valuable comments and appreciation of the ideas and results presented in the paper. We summarize the main questions from the reviewers and address them separately below. T o Reviewer #1 Q1: Network connectivity is presumably known . . . it seems all the graphs considered are com-3 We note that the network connectivity is not assumed to be known. T o Reviewer #3 Q1: Scope of the paper/Missing related work. " and "FedNAS" are about We can add an explanation to clarify the MTL scope of the paper.

agent, byz, experiment, (15 more...)

Neural Information Processing Systems

Technology:

Information Technology > Communications > Networks (0.57)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.31)

Add feedback

Enhanced Sparse Point Cloud Data Processing for Privacy-aware Human Action Recognition

Tunau, Maimunatu, Zakka, Vincent Gbouna, Dai, Zhuangzhuang

arXiv.org Artificial IntelligenceAug-15-2025

Human Action Recognition (HAR) plays a crucial role in healthcare, fitness tracking, and ambient assisted living technologies. While traditional vision based HAR systems are effective, they pose privacy concerns. mmWave radar sensors offer a privacy preserving alternative but present challenges due to the sparse and noisy nature of their point cloud data. In the literature, three primary data processing methods: Density-Based Spatial Clustering of Applications with Noise (DBSCAN), the Hungarian Algorithm, and Kalman Filtering have been widely used to improve the quality and continuity of radar data. However, a comprehensive evaluation of these methods, both individually and in combination, remains lacking. This paper addresses that gap by conducting a detailed performance analysis of the three methods using the MiliPoint dataset. We evaluate each method individually, all possible pairwise combinations, and the combination of all three, assessing both recognition accuracy and computational cost. Furthermore, we propose targeted enhancements to the individual methods aimed at improving accuracy. Our results provide crucial insights into the strengths and trade-offs of each method and their integrations, guiding future work on mmWave based HAR systems

artificial intelligence, dbscan, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2508.10469

Genre: Research Report > New Finding (0.34)

Industry:

Information Technology > Software (0.63)
Health & Medicine > Consumer Health (0.54)
Information Technology > Security & Privacy (0.48)
Information Technology > Smart Houses & Appliances (0.46)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.94)

Add feedback

Advancing Vision-based Human Action Recognition: Exploring Vision-Language CLIP Model for Generalisation in Domain-Independent Tasks

Shandilya, Utkarsh, Kappan, Marsha Mariya, Jain, Sanyam, Sharma, Vijeta

arXiv.org Artificial IntelligenceAug-1-2025

Human action recognition plays a critical role in healthcare and medicine, supporting applications such as patient behavior monitoring, fall detection, surgical robot supervision, and procedural skill assessment. While traditional models like CNNs and RNNs have achieved moderate success, they often struggle to generalize across diverse and complex actions. Recent advancements in vision-language models, especially the transformer-based CLIP model, offer promising capabilities for generalizing action recognition from video data. In this work, we evaluate CLIP on the UCF-101 dataset and systematically analyze its performance under three masking strategies: (1) percentage-based and shape-based black masking at 10%, 30%, and 50%, (2) feature-specific masking to suppress bias-inducing elements, and (3) isolation masking that retains only class-specific regions. Our results reveal that CLIP exhibits inconsistent behavior and frequent misclassifications, particularly when essential visual cues are obscured. To overcome these limitations, we propose incorporating class-specific noise, learned via a custom loss function, to reinforce attention to class-defining features. This enhancement improves classification accuracy and model confidence while reducing bias. We conclude with a discussion on the challenges of applying such models in clinical domains and outline directions for future work to improve generalizability across domain-independent healthcare scenarios.

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2507.18675

Country: Oceania > Australia (0.28)

Genre: Research Report > New Finding (0.88)

Industry:

Health & Medicine > Diagnostic Medicine > Imaging (0.46)
Leisure & Entertainment > Sports > Track & Field (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Variational Graph Convolutional Neural Networks

Oleksiienko, Illia, Kanniainen, Juho, Iosifidis, Alexandros

arXiv.org Artificial IntelligenceJul-3-2025

Estimation of model uncertainty can help improve the explainability of Graph Convolutional Networks and the accuracy of the models at the same time. Uncertainty can also be used in critical applications to verify the results of the model by an expert or additional models. In this paper, we propose Variational Neural Network versions of spatial and spatio-temporal Graph Convolutional Networks. We estimate uncertainty in both outputs and layer-wise attentions of the models, which has the potential for improving model explainability. We showcase the benefits of these models in the social trading analysis and the skeleton-based human action recognition tasks on the Finnish board membership, NTU-60, NTU-120 and Kinetics datasets, where we show improvement in model accuracy in addition to estimated model uncertainties.

artificial intelligence, machine learning, neural network, (16 more...)

arXiv.org Artificial Intelligence

2507.01699

Country: Europe > Finland (0.14)

Genre: Research Report (0.64)

Industry: Banking & Finance > Trading (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Pose Matters: Evaluating Vision Transformers and CNNs for Human Action Recognition on Small COCO Subsets

Tang, MingZe, Kazi, Madiha

arXiv.org Artificial IntelligenceJun-16-2025

This study explores human action recognition using a three-class subset of the COCO image corpus, benchmarking models from simple fully connected networks to transformer architectures. The binary Vision Transformer (ViT) achieved 90% mean test accuracy, significantly exceeding multiclass classifiers such as convolutional networks (approximately 35%) and CLIP-based models (approximately 62-64%). A one-way ANOVA (F = 61.37, p < 0.001) confirmed these differences are statistically significant. Qualitative analysis with SHAP explainer and LeGrad heatmaps indicated that the ViT localizes pose-specific regions (e.g., lower limbs for walking or running), while simpler feed-forward models often focus on background textures, explaining their errors. These findings emphasize the data efficiency of transformer representations and the importance of explainability techniques in diagnosing class-specific failures.

accuracy, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

2506.11678

Genre: Research Report (0.91)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Conformal Predictions for Human Action Recognition with Vision-Language Models

Tim, Bary, Clément, Fuchs, Benoît, Macq

arXiv.org Artificial IntelligenceFeb-10-2025

Human-In-The-Loop (HITL) frameworks are integral to many real-world computer vision systems, enabling human operators to make informed decisions with AI assistance. Conformal Predictions (CP), which provide label sets with rigorous guarantees on ground truth inclusion probabilities, have recently gained traction as a valuable tool in HITL settings. One key application area is video surveillance, closely associated with Human Action Recognition (HAR). This study explores the application of CP on top of state-of-the-art HAR methods that utilize extensively pre-trained Vision-Language Models (VLMs). Our findings reveal that CP can significantly reduce the average number of candidate classes without modifying the underlying VLM. However, these reductions often result in distributions with long tails. To address this, we introduce a method based on tuning the temperature parameter of the VLMs to minimize these tails without requiring additional calibration data. Our code is made available on GitHub at the address https://github.com/tbary/CP4VLM.

artificial intelligence, deep learning, machine learning, (15 more...)

arXiv.org Artificial Intelligence

2502.06631

Country: Europe > Belgium > Wallonia > Walloon Brabant > Louvain-la-Neuve (0.05)

Genre: Research Report > New Finding (0.67)

Industry: Commercial Services & Supplies > Security & Alarm Services (0.34)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback